Skip to content

Fix jl_object_id__cold segfault with race condition defence #635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

MilesCranmer
Copy link
Contributor

@MilesCranmer MilesCranmer commented Jul 6, 2025

@cjdoris possible good news for JuliaLang/julia#58171. See below.

Out of all of these changes, I think the crucial one is PYJLVALUES. But the others are good defensive patterns so I think we should do it anyways.

I can now run

python -c 'from juliacall import Main as jl; [jl.Ref(jl.randn(1)) for _ in range(1000000)]'

without getting a segfault on Python 3.13.

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

This suggests we do need a Julia side lock for the GIL.

#627

Maybe we should try that first?

@dpinol
Copy link
Contributor

dpinol commented Jul 6, 2025

Cool, I'm running my workloads to see how it behaves 🙏

@MilesCranmer
Copy link
Contributor Author

@mkitti I think that might be independent of this issue? Or maybe an example would help

@MilesCranmer
Copy link
Contributor Author

@mkitti AFAICT the GIL.@lock/GIL.@unlock macros do not come into play in the segfault we were seeing so I am pretty sure these issues are orthogonal

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

You essentially have put a ReentrantLock on a bunch of data structures. Are the data structures themselves not thread safe?

These are Julian data structures, so this is unlikely.

What has been thread unsafe is Python and the GIL locking mechanisms.

Rather than creating a bunch of locks, we need one central one around the GIL locking mechanism.

@MilesCranmer
Copy link
Contributor Author

You essentially have put a ReentrantLock on a bunch of data structures. Are the data structures themselves not thread safe?

Array is not thread safe for writes

What has been thread unsafe is Python and the GIL locking mechanisms.

Rather than creating a bunch of locks, we need one central one around the GIL locking mechanism.

Can you give an example? I am not sure I follow. There is no manual GIL manipulation in the segfault I am seeing. i.e., nowhere in the code path is there a PyEval_SaveThread as far as I can tell.

Copy link

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like many of these functions would mandate holding the Python lock (GIL). Holding (or having) a julia lock also seems a recipe for deadlock and concurrency problems long term, though might fix some of the crashes short term

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

Essentially, rather than adding a bunch of locks, let's just add one ReentrantLock in a central location so that only one Julia thread can communicate with Python at a time.

@MilesCranmer
Copy link
Contributor Author

Doesn't the Python GIL already result in this being true?

@MilesCranmer
Copy link
Contributor Author

@dpinol did it work btw?

@mkitti
Copy link
Member

mkitti commented Jul 6, 2025

Doesn't the Python GIL already result in this being true?

The GIL guarantees that only one Python thread is executing at any point in time. It does not guarantee that only one Julia thread will try to interact with it at a time.

Copy link
Contributor Author

@MilesCranmer MilesCranmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm confused by this:

let's just add one ReentrantLock in a central location so that only one Julia thread can communicate with Python at a time

Because multiple Julia threads wouldn't be able to communicate to Python anyways, right. So we only need thread safety for Julia objects, which is what this PR fixes. Does that make sense or am I not understanding things.

@MilesCranmer MilesCranmer force-pushed the fix-segfault branch 2 times, most recently from cd9f5c9 to d4a94f9 Compare July 6, 2025 18:39
@vtjnash
Copy link

vtjnash commented Jul 6, 2025

This PR looks unsound, since you need to decide whether you want holding this lock to mean it is forbidden to access any Python state or if holding the Python lock means that it is forbidden to access any state protected by this lock. You cannot mix and match and hold both locks, unless you guarantee you always acquire the Python one first, in which case the Julia one is redundant and unnecessary. This is the "philosophers dining problem" of deadlock

@visr
Copy link
Contributor

visr commented Jul 6, 2025

@visr this PR is meant to solve a different segfault

I understand, and don't mean to snipe you into solving that as well, but as (I believe) a counterexample to what you said earlier:

Because multiple Julia threads wouldn't be able to communicate to Python anyways, right. So we only need thread safety for Julia objects, which is what this PR fixes.

And to nudge to what could perhaps be a more general solution as Mark and Jameson seem to point to as well.

@MilesCranmer
Copy link
Contributor Author

If you're interested feel free to try. This PR is an imperfect but safe solution to a very specific segfault (jl_object_id__cold) when running Julia from Python via juliacall. This segfault has been impacting PySR and other Python libraries for over a year, basically preventing us from even starting Julia on Python 3.13+. It is an extremely high priority fix. I'd be happy to help think about a more general approach down the road, but for now, others would need to push that direction, as I just don't have the time.

@vtjnash
Copy link

vtjnash commented Jul 6, 2025

Any function that accesses a Python object or Python state is not allowed to use a Julia lock as it must be using the GIL exclusively. To use a Julia lockable there may make those functions unsound (UB). This deserves some documentation in your PR to indicate which locks are permitted to be held at the same time and which are forbidden (UB) to acquire on the same thread at the same time

@MilesCranmer
Copy link
Contributor Author

@vtjnash I don't think these concerns apply here. This PR addresses a very specific and immediate segfault on the Python=>Julia side. These particular instances of lockables do not create a Julia-Python lock-ordering issue.

Currently, the data structures they guard get corrupted without the locking. This is a real, immediate, and severe problem, rather than something theoretical. To emphasise: Julia crashes immediately on Python 3.13, on both macOS and Windows.

I fully support better long-term solutions (contributions welcome!), but I don't want to let perfect be the enemy of good here. I strongly prefer fixing the empirical segfault first, then tackling theoretical concerns afterward.

@vtjnash
Copy link

vtjnash commented Jul 7, 2025

I don't mean to imply these UB are theoretical concerns, since deadlocks are just as bad for the user and even worse for CI reliability. If you cannot even document the API requirements, then this PR seems like just added technical debt that someone in the future needs to undo. According to the existing documentation, before calling these functions it is mandatory for the caller to have already acquired the GIL (

Most importantly, you can only call Python code while Python's
[Global Interpreter Lock (GIL)](https://docs.python.org/3/glossary.html#term-global-interpreter-lock)
is locked by the current thread. You can use JuliaCall from any Python thread, and the GIL
will be locked whenever any JuliaCall function is used. However, to leverage the benefits
of multi-threading, you can unlock the GIL while executing any Julia code that does not
interact with Python.
), which means the current documentation is stating this is an API misuse and that these structures are not supposed to have internal locks, but rather are supposed to be protected by the GIL

@vtjnash
Copy link

vtjnash commented Jul 7, 2025

In particular, the lock you added here to fix the crash appears to be the wrong lock to be using in the functions where it is used, since they seem to be direct calls into the python runtime (which require holding the GIL), so you are not actually fixing the cause of the segfault, but might be making it harder for someone to diagnose and fix correctly later (which appears to be in looking at the code to require holding the GIL, and which is what the package documentation says that all functions will do)

    ccall(UnsafePtr{C.PyTypeObject}(C.Py_Type(o)).free[!], Cvoid, (C.PyPtr,), o)

Since the documentation says every function should either already have or will acquire the GIL, it is strictly forbidden to introduce any new locks into this package (like this PR does), except for functions that do not interact with any python objects, since any lock acquired between the two recursive lock acquires would cause a violation of lock ordering.

@MilesCranmer
Copy link
Contributor Author

Can you illustrate your concerns with an example?

@MilesCranmer
Copy link
Contributor Author

I think worrying about API contracts when Julia won’t even start without a crash is like worrying about getting enough vegetables when your heart is stopped. Let’s do CPR first and worry about diet later.

@mkitti
Copy link
Member

mkitti commented Jul 7, 2025

This is the deadlock scenario when you have multiple locks in play.

julia> import Base: Lockable

julia> const PYJLVALUES = Lockable([]);

julia> const PYJLFREEVALUES = Lockable([]);

julia> function f()
           lock(PYJLVALUES)
           println("Got PYJLVALUES")
           sleep(1)
           lock(PYJLFREEVALUES)
           println("f() got both locks")
           unlock(PYJLFREEVALUES)
           unlock(PYJLVALUES)
       end
f (generic function with 1 method)

julia> function g()
           lock(PYJLFREEVALUES)
           println("Got PYJLFREEVALUES")
           sleep(1)
           lock(PYJLVALUES)
           println("g() got both locks")
           unlock(PYJLVALUES)
           unlock(PYJLFREEVALUES)
       end
g (generic function with 1 method)

julia> Threads.@spawn f(); Threads.@spawn g()
Got PYJLVALUESTask (runnable, started) @0x00000242bbaaf080


Got PYJLFREEVALUES

# Deadlock

whereas if we had a single lock

julia> import Base: Lockable

julia> const _jl_gil_lock = ReentrantLock();

julia> const PYJLVALUES = Lockable([], _jl_gil_lock);

julia> const PYJLFREEVALUES = Lockable([], _jl_gil_lock);

julia> function f()
           lock(PYJLVALUES)
           println("Got PYJLVALUES")
           sleep(1)
           lock(PYJLFREEVALUES)
           println("f() got both locks")
           unlock(PYJLFREEVALUES)
           unlock(PYJLVALUES)
       end
f (generic function with 1 method)

julia> function g()
           lock(PYJLFREEVALUES)
           println("Got PYJLFREEVALUES")
           sleep(1)
           lock(PYJLVALUES)
           println("g() got both locks")
           unlock(PYJLVALUES)
           unlock(PYJLFREEVALUES)
       end
g (generic function with 1 method)

julia> Threads.@spawn f(); Threads.@spawn g()
Task (runnable, started) @0x0000019a94e809c0

julia> Got PYJLVALUES
f() got both locks
Got PYJLFREEVALUES
g() got both locks

julia>

@MilesCranmer
Copy link
Contributor Author

Added it. Once one of the PRs merges, GLOBAL_LOCK can be renamed to _jl_gil_lock.

@vtjnash
Copy link

vtjnash commented Jul 7, 2025

That would still introduce different unsoundness, since here you're causing deadlock / recursion mistakes, and with the rename, you'd be failing to call disable_finalizers where required for avoiding memory corruption (ReentrantLock includes that guard).

@MilesCranmer
Copy link
Contributor Author

Reverted the global lock.

The original code with separate locks doesn’t appear to have deadlock issues. @mkitti's example isn't possible as those globals are behind a shared reentrant lock.

If there are other genuine scenarios I'm missing, a concrete example would greatly help. Right now, broad concerns about unsoundness aren't giving me enough to go on.

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 10, 2025

@MilesCranmer thanks for the PR, I'm glad we're closer to knowing what's going on.

I somewhat agree with @mkitti though, I don't think we should make every mutable global in sight into a Lockable. It makes the package slower and more complicated to maintain, and anyway I don't understand why this should be necessary. For now, can we just lock PYJLVALUES if that's the culprit for the current issue?

The fact that Lockable fixes things implies there is some race condition on PYJLVALUES, but I had thought that only one task or thread should be able to access this at a time anyway.

However, we now have an opportunity to diagnose the problem by inspecting activity of this lock. In particular, the times when we had to wait for the lock to be freed are the times there would have been illegal concurrent access. I suggest printing the stacktrace every time the lock is locked, and also whether or not we had to wait (which you can do with trylock).

I appreciate you want a quick fix but hopefully the above will shed some light on the underlying issue.

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Jul 10, 2025

I'm happy to try. However, regarding:

I don't think we should make every mutable global in sight into a Lockable

I want to push back on this. I think defensive programming against thread races in this exact pattern (with rare exceptions) is a very important software engineering habit. Julia data structures are not thread safe internally, so I believe it's incredibly important that any thread race should either (1) error loudly or be (2) safely avoided. The worst of the options, (3) is quiet corruption - which unlocked global mutables are always exposed to.

If code is already thread safe, then locking will have negligible performance effect - so there's no problem! But if code is not thread safe, then you really really really want locks, otherwise we get the worst classes of bugs imaginable. (Which I have many scars to show for!) And even if code doesn't use multithreading on those parts now, someone might add it in the future. Or perhaps someone starts multithreading over nested PythonCall code in an outer loop somewhere. e.g., one time I segfaulted because I had two Wandb.jl loggers running in parallel in SymbolicRegression.jl (Wandb.jl uses PythonCall.jl internally) - who knows if they had simply hit one of these internal functions at the same time, or if it was something else.

If the performance is a worry, which I really do not think it would be be, I would recommend we make up some ErrorLock that simply errors whenever there is an attempted unlock from a second thread. That way we can be sure that any thread race, even if none are happening now, will be very obvious in the future.

@MilesCranmer
Copy link
Contributor Author

Actually how is this? I made an ErrorLock that will literally just error if you try to unlock a locked global from a different thread. So we should see errors if there are any thread races.

Also, today the segfault decided not to be reproducible on my local machine for whatever reason. I still see it in CI daily of course, though it is random. Not sure, maybe I just need to open up more applications to see it.

@cjdoris can you get it in WSL?

In any case, I guess because none of these ErrorLocks seem to be erroring that I can tell, there's no thread race??? What??

@mkitti
Copy link
Member

mkitti commented Jul 10, 2025

What I think we might need to do is

  1. Unlock the GIL on init by default as suggested by @visr here:
    Segfault on propertynames #586 (comment)
  2. Have every public or exported API function lock the GIL.
  3. Provide non-locking versions of functions in a clearly marked module or perhaps even split this off into another package.

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 11, 2025

@MilesCranmer OK yeah the ErrorLock looks good - just need to keep trying to reproduce the error to see what it reveals. I think it would be good to save off the stacktrace whenever it is locked too, and include it in the error, so that we can compare where locked the lock and where was trying to lock it.

I still can't reproduce on WSL (or anywhere locally) since that first day. 😞 Just need to keep trying...

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 11, 2025

What I think we might need to do is

  1. Unlock the GIL on init by default as suggested by @visr here:
    Segfault on propertynames #586 (comment)
  2. Have every public or exported API function lock the GIL.
  3. Provide non-locking versions of functions in a clearly marked module or perhaps even split this off into another package.

Yep I've been wondering the same. Would be good to make a PR for this so we can investigate. I don't know what the performance hit will be.

It's not clear it will fix the present issue, but it might (hopefully ErrorLockable will shed some light).

Fundamentally, Julia will happily run any task on any thread, which is very much at odds with Python's single-threaded nature (the GIL) so something like this might be necessary. It won't be a small PR!

@dpinol
Copy link
Contributor

dpinol commented Jul 11, 2025

I'll run my workload with the latest commit.

@dpinol
Copy link
Contributor

dpinol commented Jul 12, 2025

I could not make it fail after multiple executions. I guess it depends on the machine load, because some days it crashes consistently, and others, it never fails.

@MilesCranmer
Copy link
Contributor Author

Interesting… so none of us can reproduce it today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants